Lesson 6

Welcome

Notes: In this lesson, Solomon will be using a linear regression model to predict diamond price using other variables in the diamonds dataset. If you are not familiar with linear regression, you may want to take a break and go through Lesson 15 of Udacity’s Intro to Inferential Statistics course (https://classroom.udacity.com/courses/ud201), which covers linear regression. When you’re done, you’ll be ready to come back and apply your new knowledge to the diamonds dataset! ***

Scatterplot Review

Let’s start by examining two variables in the data set. The scatterplot is a powerful tool to help you understand the relationship between two continuous variables.

We can quickly see if the relationship is linear or not. In this case, we can use a variety of diamond characteristics to help us figure out whether the price advertised for any given diamond is reasonable or a rip-off.

Let’s consider the price of a diamond and it’s carat weight. Create a scatterplot of price (y) vs carat weight (x).

Limit the x-axis and y-axis to omit the top 1% of values.

Hint: Use the quantile() function inside of xlim and ylim to omit the top 1% of values for each variable.

library(ggplot2)
ggplot(aes(x = carat, y = price), data = subset(diamonds, price < quantile(price, probs = .99) & carat < quantile(carat, probs = .99))) +
  geom_point(fill = I('#F79420'), color = I('black'), shape = 21)

qplot(data = diamonds, x = carat, y = price,
      xlim = c(0, quantile(diamonds$carat, 0.99)),
      ylim = c(0, quantile(diamonds$price, 0.99))) +
  geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).

## Warning: Removed 926 rows containing missing values (geom_point).

ggplot(aes(x = carat, y = price), data = diamonds) +
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) +
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99))) +
  geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).


Price and Carat Relationship

Response:

ggplot(aes(x = carat, y = price), data = diamonds) +
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) +
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99))) +
  geom_point(fill = I('#F79420'), color = I('black'), shape = 21, alpha = 1/4) +
  stat_smooth(method = 'lm')
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).


Frances Gerety

Notes:

A diamonds is


The Rise of Diamonds

Notes: The Counte of Monte Cristo (https://www.youtube.com/watch?v=PJyRNeCErNE) and Knocked Up (https://www.youtube.com/watch?v=Be3sZKmfb9g) are two movies that have proposals without engagement rings.

These are the only two movies that came to mind…after A LOT of time. Can you think of any others? If so, share it in the discussions! ***

ggpairs Function

Notes: You can click on the packages tab in RStudio to determine which packages have been installed.

You may receive a message when installing the new packages. If so, click cancel, clear your workspace, and try installing the packages again.

In this video, Solomon works with the plyr package. We worked with the dplyr package to manipulate data frames and to create new ones throughout the course. dplyr is the latest version of plyr that is specifically for working with data frames.

Similarly, we worked with the reshape2 package, which is the newest version of the reshape package.

ggpairs output (https://s3.amazonaws.com/udacity-hosted-downloads/ud651/ggpairs_landscape.pdf) When you duplicate the plot matrix on your local machine, you may want to add a axisLabels = ‘internal’ argument to your ggpairs function call to have the variable names on the diagonal of the matrix rather than on the outside.

# install these if necessary
#install.packages('GGally') # for multiple particular plot
#install.packages('scales') # for a variey of things
#install.packages('memisc') # to summarize regression
#install.packages('lattice') # 
#install.packages('MASS') # for various functions  
#install.packages('car') # to recode variables
#install.packages('reshape2') # 
#install.packages('plyr') # to create interesting summaries and transmissions

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Warning: package 'memisc' was built under R version 3.4.2
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]

ggpairs(diamond_samp, #axisLabels = 'internal',
  lower = list(continuous = wrap("points", shape = I('.'))),
  upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What are some things you notice in the ggpairs output? Response:


The Demand of Diamonds

Notes: Create two histograms of the price variable and place them side by side on one output image.

We’ve put some code below to get you started.

The first plot should be a histogram of price and the second plot should transform the price variable using log10.

Set appropriate bin widths for each plot. ggtitle() will add a title to each histogram.

You can self-assess your work with the plots in the solution video.

library(gridExtra)

plot1 <- ggplot(aes(x = price), data = diamonds) + 
  geom_histogram(binwidth = 100, fill = I('#099DD9')) +
  ggtitle('Price')

plot2 <- ggplot(aes(x = price), data = diamonds) +
  geom_histogram(binwidth = 0.01, fill = I('#F79420')) +
  scale_x_log10() +
  ggtitle('Price (log10)')

grid.arrange(plot1, plot2, ncol=2)


Connecting Demand and Price Distributions

Notes: When looking at these plots, what do you notice? Think specifically about the two peaks in the transformed plot and how it relates to the demand for diamonds.

Answer: There are two peaks in log scale plot, while in the normal plot I cannot distingush shuch peask. ***

Scatterplot Transformation

Basic Structure of a Function (https://www.youtube.com/watch?v=Z1wB1rHAYzQ&list=PLOU2XLYxmsIK9qQfztXeybpHvru-TrqAP)

ggplot(aes(x = carat, y = price), data = diamonds) +
  geom_point() +
  scale_y_continuous(trans = log10_trans()) +
  ggtitle('Price (log10) by Carat')

Create a new function to transform the carat variable

cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha = 1/2, size = 0.75, position = 'jitter') + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).


Overplotting Revisited

head(sort(table(diamonds$carat), decreasing = T))
## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121
ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).


Other Qualitative Factors

Notes: What makes a diamond sparkle? (http://www.bluenile.com/diamonds/diamond-cut)

Does clarity affect the sparkle? (http://www.bluenile.com/diamonds/diamond-clarity)

What’s a jeweler’s loupe? (http://en.wikipedia.org/wiki/Loupe)


Price vs. Carat and Clarity

Adjust the code below to color the points by clarity.

A layer called scale_color_brewer() has been added to adjust the legend and provide custom colors.

See if you can figure out what it does. Links to resources are in the Instructor Notes.

You will need to install the package RColorBrewer in R to get the same colors and color palettes.

# install and load the RColorBrewer package
#install.packages('RColorBrewer', dependencies = TRUE)
library(RColorBrewer)

ggplot2: scale_colour_brewer ggplot2: Color Brewer Palettes and Safe Colors/#palettes-color-brewer) ggplot2: Legends/)

override.aes() change legend symols to be larger or more be transparent than the plot’s ones.

ggplot(aes(x = carat, y = price, colour = clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).


Clarity and Price

Response:


Price vs. Carat and Cut

Alter the code below.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Clarity', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and cut')
## Warning: Removed 1696 rows containing missing values (geom_point).


Cut and Price

Response:


Price vs. Carat and Color

Alter the code below.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Color', reverse = FALSE,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).


Color and Price

Response:


Linear Models in R

Notes:

Response:


Building the Linear Model

Notes: Linear Models and Operators in R (http://data.princeton.edu/R/linearModels.html)

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
#mtable(m1, m2, m3, m4, m5)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## ============================================================================================
##                        m1             m2             m3             m4            m5        
## --------------------------------------------------------------------------------------------
##   (Intercept)          2.821***       1.039***       0.874***      0.932***       0.415***  
##                       (0.006)        (0.019)        (0.019)       (0.017)        (0.010)    
##   I(carat^(1/3))       5.558***       8.568***       8.703***      8.438***       9.144***  
##                       (0.007)        (0.032)        (0.031)       (0.028)        (0.016)    
##   carat                              -1.137***      -1.163***     -0.992***      -1.093***  
##                                      (0.012)        (0.011)       (0.010)        (0.006)    
##   cut: .L                                            0.224***      0.224***       0.120***  
##                                                     (0.004)       (0.004)        (0.002)    
##   cut: .Q                                           -0.062***     -0.062***      -0.031***  
##                                                     (0.004)       (0.003)        (0.002)    
##   cut: .C                                            0.051***      0.052***       0.014***  
##                                                     (0.003)       (0.003)        (0.002)    
##   cut: ^4                                            0.018***      0.018***      -0.002     
##                                                     (0.003)       (0.002)        (0.001)    
##   color: .L                                                       -0.373***      -0.441***  
##                                                                   (0.003)        (0.002)    
##   color: .Q                                                       -0.129***      -0.093***  
##                                                                   (0.003)        (0.002)    
##   color: .C                                                        0.001         -0.013***  
##                                                                   (0.003)        (0.002)    
##   color: ^4                                                        0.029***       0.012***  
##                                                                   (0.003)        (0.002)    
##   color: ^5                                                       -0.016***      -0.003*    
##                                                                   (0.003)        (0.001)    
##   color: ^6                                                       -0.023***       0.001     
##                                                                   (0.002)        (0.001)    
##   clarity: .L                                                                     0.907***  
##                                                                                  (0.003)    
##   clarity: .Q                                                                    -0.240***  
##                                                                                  (0.003)    
##   clarity: .C                                                                     0.131***  
##                                                                                  (0.003)    
##   clarity: ^4                                                                    -0.063***  
##                                                                                  (0.002)    
##   clarity: ^5                                                                     0.026***  
##                                                                                  (0.002)    
##   clarity: ^6                                                                    -0.002     
##                                                                                  (0.002)    
##   clarity: ^7                                                                     0.032***  
##                                                                                  (0.001)    
## --------------------------------------------------------------------------------------------
##   R-squared            0.924          0.935          0.939         0.951          0.984     
##   adj. R-squared       0.924          0.935          0.939         0.951          0.984     
##   sigma                0.280          0.259          0.250         0.224          0.129     
##   F               652012.063     387489.366     138654.523     87959.467     173791.084     
##   p                    0.000          0.000          0.000         0.000          0.000     
##   Log-likelihood   -7962.499      -3631.319      -1837.416      4235.240      34091.272     
##   Deviance          4242.831       3613.360       3380.837      2699.212        892.214     
##   AIC              15930.999       7270.637       3690.832     -8442.481     -68140.544     
##   BIC              15957.685       7306.220       3761.997     -8317.942     -67953.736     
##   N                53940          53940          53940         53940          53940         
## ============================================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.


Model Problems

Video Notes:

Let’s put our model in a larger context. Assuming that the data is not somehow corrupted and we are not egregiously violating some of the key assumptions of linear regression (for example, violating the IID assumption by having a bunch of duplicated observations in our data set), what could be some problems with this model? What else should we think about when using this model?

Take your time to answer this question, do some qualitative research about the diamond market. See the links below to get started.

Diamond Prices over the Years (http://www.pricescope.com/diamond-prices/diamond-prices-chart)

Global Diamond Report (http://www.bain.com/publications/articles/global-diamond-report-2013.aspx)

Falling Supply and Rising Demand: Couples in Shanghai take to the Ring (http://diamonds.blogs.com/diamonds_update/diamond-prices/)

If you’d like to learn more about linear models and how to interpret regression coefficients, please refer to the following articles.

Interpreting Regression Coefficients in R on R Bloggers (http://www.r-bloggers.com/interpreting-regression-coefficient-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29)

Interpreting Regression Coefficients on the Analysis Factor blog (http://www.theanalysisfactor.com/interpreting-regression-coefficients/)

Fitting and Interpreting Linear Models by ŷhat (http://blog.yhat.com/posts/r-lm-summary.html)

Another Explanation of Factor Coefficients in Linear Models on Stats StackExchange http://stats.stackexchange.com/a/24256)

Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response:


A Bigger, Better Data Set

Notes: Your task is to build five linear models like Solomon did for the diamonds data set only this time you’ll use a sample of diamonds from the diamonds big data set.

Be sure to make use of the same variables (logprice, carat, etc.) and model names (m1, m2, m3, m4, m5).

To get the diamonds big data into RStudio on your machine, copy, paste, and run the code in the Instructor Notes. There’s 598,024 diamonds in this data set!

Since the data set is so large, you are going to use a sample of the data set to compute the models. You can use the entire data set on your machine which will produce slightly different coefficients and statistics for the models.

Your task is to write the code to create the models.

#install.packages('bitops')
#install.packages('RCurl')
#library('bitops')
#library('RCurl')

#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))

load("BigDiamonds.rda")
diamondsbig$logprice <- log(diamondsbig$price)
m1 <- lm(logprice ~I(carat^(1/3)),
         data = diamondsbig[diamondsbig$price < 10000 &
                              diamondsbig$cert == "GIA",])
#m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##         "GIA", ])
## m5: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##     "GIA", ])
## 
## ================================================================================================
##                         m1              m2             m3             m4              m5        
## ------------------------------------------------------------------------------------------------
##   (Intercept)           2.671***        1.333***       0.949***       0.529***       -0.464***  
##                        (0.003)         (0.012)        (0.012)        (0.010)         (0.009)    
##   I(carat^(1/3))        5.839***        8.243***       8.633***       8.110***        8.320***  
##                        (0.004)         (0.022)        (0.021)        (0.017)         (0.012)    
##   carat                                -1.061***      -1.223***      -0.782***       -0.763***  
##                                        (0.009)        (0.009)        (0.007)         (0.005)    
##   cut: V.Good                                          0.120***       0.090***        0.071***  
##                                                       (0.002)        (0.001)         (0.001)    
##   cut: Ideal                                           0.211***       0.181***        0.131***  
##                                                       (0.002)        (0.001)         (0.001)    
##   color: K/L                                                          0.123***        0.117***  
##                                                                      (0.004)         (0.003)    
##   color: J/L                                                          0.312***        0.318***  
##                                                                      (0.003)         (0.002)    
##   color: I/L                                                          0.451***        0.469***  
##                                                                      (0.003)         (0.002)    
##   color: H/L                                                          0.569***        0.602***  
##                                                                      (0.003)         (0.002)    
##   color: G/L                                                          0.633***        0.665***  
##                                                                      (0.003)         (0.002)    
##   color: F/L                                                          0.687***        0.723***  
##                                                                      (0.003)         (0.002)    
##   color: E/L                                                          0.729***        0.756***  
##                                                                      (0.003)         (0.002)    
##   color: D/L                                                          0.812***        0.827***  
##                                                                      (0.003)         (0.002)    
##   clarity: I1                                                                         0.301***  
##                                                                                      (0.006)    
##   clarity: SI2                                                                        0.607***  
##                                                                                      (0.006)    
##   clarity: SI1                                                                        0.727***  
##                                                                                      (0.006)    
##   clarity: VS2                                                                        0.836***  
##                                                                                      (0.006)    
##   clarity: VS1                                                                        0.891***  
##                                                                                      (0.006)    
##   clarity: VVS2                                                                       0.935***  
##                                                                                      (0.006)    
##   clarity: VVS1                                                                       0.995***  
##                                                                                      (0.006)    
##   clarity: IF                                                                         1.052***  
##                                                                                      (0.006)    
## ------------------------------------------------------------------------------------------------
##   R-squared             0.888           0.892          0.899          0.937           0.969     
##   adj. R-squared        0.888           0.892          0.899          0.937           0.969     
##   sigma                 0.289           0.284          0.275          0.216           0.154     
##   F               2700903.714     1406538.330     754405.425     423311.488      521161.443     
##   p                     0.000           0.000          0.000          0.000           0.000     
##   Log-likelihood   -60137.791      -53996.269     -43339.818      37830.414      154124.270     
##   Deviance          28298.689       27291.534      25628.285      15874.910        7992.720     
##   AIC              120281.582      108000.539      86691.636     -75632.827     -308204.540     
##   BIC              120313.783      108043.473      86756.037     -75482.557     -307968.400     
##   N                338946          338946         338946         338946          338946         
## ================================================================================================

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

Notes:


Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)

exp(modelEstimate)
##        fit     lwr      upr
## 1 5040.436 3730.34 6810.638

The prediction interval here may be slightly conservative, as the model errors are heteroskedastic over carat (and hence price) even after our log and cube-root transformations.

See the output of the following code.

dat = data.frame(m4$model, m4$residuals)

with(dat, sd(m4.residuals))
## [1] 0.2164168
with(subset(dat, carat > .9 & carat < 1.1), sd(m4.residuals))
## [1] 0.2178147
dat$resid <- as.numeric(dat$m4.residuals)
ggplot(aes(y = resid, x = round(carat, 2)), data = dat) +
  geom_line(stat = "summary", fun.y = sd)
## Warning: Removed 1 rows containing missing values (geom_path).

How could we do better? If we care most about diamonds with carat weights between 0.50 and 1.50, we might restrict the data we use to fit our model to diamonds that are that size - we have enough data.

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.


Final Thoughts

Notes: Tiffany vs. Costco: Which Diamond Ring is Better (http://www.businessweek.com/articles/2013-05-06/tiffany-vs-dot-costco-which-diamond-ring-is-better)

But Costco Sells Pricy Diamonds Too (http://www.costco.com/CatalogSearch?catalogId=10701&langId=-1&keyword=Engagement&storeId=10301&refine=30108%2b209%2b238%2b210%2b) ***

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!